Corpus linguistics and naive discriminative learning
نویسنده
چکیده
Three classifiers from machine learning (the generalized linear mixed model, memory based learning, and support vector machines) are compared with a naive discriminative learning classifier, derived from basic principles of error-driven learning characterizing animal and human learning. Tested on the dative alternation in English, using the Switchboard data from (Bresnan, Cueni, Nikitina, & Baayen, 2007), naive discriminative learning emerges with state-of-the-art predictive accuracy. Naive discriminative learning offers a unified framework for understanding the learning of probabilistic distributional patterns, for classification, and for a cognitive grounding of distinctive collexeme analysis. According to Gries (2011), linguistics is a distributional science exploring the distribution of elements at all levels of linguistic structure. He describes corpus linguistics as investigating the frequencies of occurrence of such elements in corpora, their dispersion, and their co-occurrence properties. Although this characterization of presentday corpus linguistics is factually correct, the aim of the present paper is to argue that corpus linguistics should be more ambitious, and that for a better understanding of the data its current descriptive approach may profit from complementation with cognitive computational modeling. Consider the dative alternation in English. Bresnan et al. (2007) presented an analysis of the dative alternation in which the choice between the double object construction (Mary gave John the book) and the prepositional object construction (Mary gave the book to John) was modeled as a function of a wide range of predictors, including the accessibility, definiteness, length, and animacy of theme and recepient (see also Ford & Bresnan, 2010). A mixed-effects logistic regression model indicated that their variables were highly succesful in predicting which construction is most likely to be used, with approximately 94% accuracy. The statistical technique used by Bresnan and colleagues, logistic regression modeling, is but one of many excellent statistical classifiers currently available to the corpus linguist, such as memory based learning (MBL, Daelemans & Bosch, 2005), analogical modeling of language (AML, Skousen, 1989), support vector machines CORPUS LINGUISTICS AND NAIVE DISCRIMINATIVE LEARNING 2 (SVM, Vapnik, 1995), and random forests (RF, Strobl, Malley, & Tutz, 2009; Tagliamonte & Baayen, 2010). The mathematics underlying these techniques varies widely, from iterative optimization of the model fit (regression), nearest-neighbor similaritybased inference (memory based learning), kernel methods (support vector machines), and recursive conditioning with subsampling (random forests). All these statistical techniques tend to provide a good description of the speaker-listener’s knowledge, but it is unlikely that they provide a good characterization of how speaker-listeners actually acquire and use this knowledge. Of these four techniques, only memorybased learning, as a computational implementation of an exemplar-based model, may arguably reflect human performance. A first question addressed in the present study is whether these different statistical models provide a correct characterization of the knowledge that a speaker has of how to choose between these two dative constructions. A statistical model may faithfully reflect a speaker’s knowledge, but it is also conceivable that it underestimates or overestimates what native speakers of English actually have internalized. This question will be addressed by comparing statistical models with a model based on principles of human learning. A second question concerns how frequency of occurrence and co-occurrence frequencies come into play in human classification behavior as compared to machine classification. For machine classification, we can easily count how often a linguistic element occurs, and how often it co-occurs with other elements. The success of machine classification in reproducing linguistic choice behavior suggests that probabilities of occurrence are somehow available to the human classifier. But is frequency of (co-)occurrence available to the human classifier in the same way as to the machine classifier? Simple frequency of occurrence information is often modeled by means of some ‘counter in the head’, implemented in cognitive models in the form of ‘resting activation levels’, as in the interactive activation models of McClelland and Rumelhart (1981); Coltheart, Rastle, Perry, Langdon, and Ziegler (2001); Van Heuven, Dijkstra, and Grainger (1998), in the form of frequency based rankings (Murray & Forster, 2004), as a unit’s verification time (Levelt, Roelofs, & Meyer, 1999), or in the Bayesian approach of Norris, straightforwardly as a unit’s long-term a-priori probability (Norris, 2006; Norris & McQueen, 2008). A potential problem that arises in this context is that large numbers of such ‘counters in the head’ are required, not only for simple or complex words, but also for hundreds of millions of word n-grams, given recent experimental results indicating human sensitivity to n-gram frequency (Arnon & Snider, 2010; Tremblay & Baayen, 2010). Moreover, given the tendency of human memory to merge, or blend, previous experiences, it is rather unlikely that the human classifier has at its disposal exactly the same frequency information that we make available to our machine classifiers. To address these questions, the present study explores what a general model of human learning may offer corpus linguistics as a computational theory of human classification. CORPUS LINGUISTICS AND NAIVE DISCRIMINATIVE LEARNING 3 Frequency Definiteness of Theme Pronominality of Theme Construction 7 definite non-pronominal NP NP 1 definite pronominal NP NP 28 indefinite non-pronominal NP NP 1 indefinite pronominal NP NP 3 definite non-pronominal NP PP 4 definite pronominal NP PP 6 indefinite non-pronominal NP PP 0 indefinite pronominal NP PP Table 1: Example instance base for discriminative learning with the Rescorla-Wagner equations, with as cues the definiteness and pronominality of the theme, and as outcome the construction (double object, NP NP, versus prepositional object, NP PP). Naive Discriminative Learning In psychology, the model of Wagner and Rescorla (1972) is one of the most influential and fruitful theories of animal and human learning (Miller, Barnet, & Grahame, 1995; Siegel & Allan, 1996). Its learning algorithm is closely related to the connectionist delta-rule (cf. Gluck & Bower, 1988; Anderson, 2000) and to the Kalman filter (cf. Dayan & Kakade, 2001), and can be viewed as an instantiation of a general probabilistic learning mechanism (see, e.g., Chater, Tenenbaum, & Yuille, 2006; Hsu, Chater, & Vitányi, 2010). The Rescorla-Wagner equations Rescorla and Wagner formulated a set of equations that specify how the strength of association of a cue in the input to a given outcome is modified by experience. By way of example, consider the instance base in Table 1, which specifies for the four combinations of the pronominality and definiteness of the theme (the book in John gave the book to Mary) which construction is used (the double object construction, NP NP, or the prepositional object construction, NP PP). The eight possible combinations occur with different frequencies, modeled on the data of Bresnan et al. (2007). The cues in this example are the values for definiteness and pronominality. The outcomes are the two constructions. There are in all 50 learning trials, more than half of which pair an indefinite non-pronominal theme with the double object construction (e.g., John gave a book to Mary). The Rescorla-Wagner equations implement a form of supervised learning. It is assumed that the learner predicts an outcome given the available cues. Depending on whether this prediction is correct, the weights (association strengths) from the cues to the outcomes are adjusted such that at subsequent trials, prediction accuracy will improve. CORPUS LINGUISTICS AND NAIVE DISCRIMINATIVE LEARNING 4 Let present(C, t) denote the presence of a cue C (definiteness, pronominality) and present(O, t) the presence of outcome O (construction) at time t, and let absent(C, t) and absent(O, t) denote their absence at time t. The Rescorla-Wagner equations specify the association strength V t+1 i of cue Ci with outcome O at time t+1 by means of the recurrence relation V t+1 i = V t i + ∆V t i , (1) which simply states that the association strength at time t+ 1 is equal to its previous association strength at time t modified by some change change in association strength ∆V t i , defined as ∆V t i = 0 if absent(Ci, t), αiβ1 ( λ−∑present(Cj , t) Vj) if present(Cj, t) & present(O, t), αiβ2 ( 0−∑present(Cj , t) Vj) if present(Cj, t) & absent(O, t). (2) Standard settings for the parameters are λ = 1, α1 = α2 = 0.1, β1 = β2 = 0.1. If a cue is not present in the input, its association strength is not changed. When the cue is present, the change in association strength depends on whether or not the outcome is present. Association strengths are increased when cue and outcome co-occur, and decreased when the cue occurs without the outcome. Furthermore, when more cues are present simultaneously, adjustments are more conservative. In this case, we can speak of cue competition. Figure 1 illustrates, for a random presentation of the 50 learning trials, how the association strengths (or weights) from cues to outcomes develop over time. As indefinite nonpronominal themes dominate the instance base, and strongly favor the double object construction, the weights from the cues indefinite and non-pronominal to the construction NP NP increase steadily during the learning process. The equilibrium equations for the Rescorla-Wagner equations The Rescorla-Wagner equations have recently turned out to be of considerable interest for understanding child language acquisition, see, for instance, Ramscar and Yarlett (2007); Ramscar, Yarlett, Dye, Denny, and Thorpe (2010); Ramscar, Dye, Popick, and O’Donnell-McCarthy (2011). For corpus linguistics, the equilibrium equations for the Rescorla-Wagner equations developed by Danks (2003) are of key interest. Danks was able to derive a set of equations that define the association strengths (weights) from cues to outcomes for the situation in which these strengths no longer change, i.e., for the adult state of the learner. It can be shown that when V t+1 i = V t i , or, equivalently, (3) V t+1 i − V t i = 0, (4) CORPUS LINGUISTICS AND NAIVE DISCRIMINATIVE LEARNING 5 0 10 20 30 40 50 0. 00 0. 10 0. 20 0. 30 time w ei gh t definite : NP NP 0 10 20 30 40 50 0. 00 0. 10 0. 20 0. 30 time w ei gh t indefinite : NP NP 0 10 20 30 40 50 0. 00 0. 10 0. 20 0. 30 time w ei gh t pronominal : NP NP 0 10 20 30 40 50 0. 00 0. 10 0. 20 0. 30 time w ei gh t non−pronominal : NP NP 0 10 20 30 40 50 0. 00 0. 10 0. 20 0. 30 time w ei gh t definite : NP PP 0 10 20 30 40 50 0. 00 0. 10 0. 20 0. 30 time w ei gh t indefinite : NP PP 0 10 20 30 40 50 0. 00 0. 10 0. 20 0. 30 time w ei gh t pronominal : NP PP 0 10 20 30 40 50 0. 00 0. 10 0. 20 0. 30 time w ei gh t non−pronominal : NP PP Figure 1. Development of the association strengths (weights) from cues (definite/indefinite/pronominal/non-pronominal) to outcomes (NP NP/NP PP) given the instance base summarized in Table 1. The 50 instance tokens were presented for learning once, in random order. the weights to the outcomes can be estimated by solving the following set of equations, with W the matrix of unknown weights: CW = O. (5) Equation (5) is formulated using notation from matrix algebra. The following example illustrates the principle of the calculations involved. ( a b c d )( v w x y ) = ( av + bx aw + by cv + dx bw + dy ) . CORPUS LINGUISTICS AND NAIVE DISCRIMINATIVE LEARNING 6 In (5), C is the matrix of conditional probabilities of the outcomes. It is obtained by first calculating the matrix M listing the frequencies with which cues co-occur: M = indefinite pronominal nonpronominal definite indefinite 35 1 34 0 pronominal 1 6 0 5 nonpronominal 34 0 44 10 definite 0 5 10 15 . (6) As can be verified by inspecting Table 1, the cue indefinite occurs 35 times, the combination of indefinite and pronominal occurs once, indefinite co-occurs 34 times with non-pronominal, and so on. From this matrix, we derive the matrix of conditional probabilies of cue j given cue i: C = indefinite pronominal nonpronominal definite indefinite 0.50 0.01 0.49 0.00 pronominal 0.08 0.50 0.00 0.42 nonpronominal 0.39 0.00 0.50 0.11 definite 0.00 0.17 0.33 0.50 . (7) The probability of indefinite given indefinite is 35/(35 + 1 + 34 + 0) = 0.5, that of indefinite given pronominal is 1/(1 + 6 + 0 + 5) = 0.083, and so on. The matrix W is the matrix of association strengths (weights) from cues (rows) to outcomes (columns) that we want to estimate. Finally, the matrix O, O = NP NP NP PP indefinite 0.41 0.09 pronominal 0.17 0.33 nonpronominal 0.40 0.10 definite 0.27 0.23 (8) lists the conditional probabilities of the constructions (columns) given the cues (rows). It is obtained from the co-occurrence matrix of cues (M ) and the co-occurrence matrix of cues and constructions N , N = NP NP NP PP indefinite 29 6 pronominal 2 4 nonpronominal 35 9 definite 8 7 . (9) For instance, the probability of the double object construction given (i) the indefinite cue is 29/(35 + 1 + 34 + 0) = 0.414 , and given (ii) the pronominal cue it is 2/(1+6+0+5) = 0.167. The set of equations (5) can be solved using the generalized CORPUS LINGUISTICS AND NAIVE DISCRIMINATIVE LEARNING 7 indefinite indefinite definite definite non-pronominal pronominal non-pronominal pronominal NP NP 0.84 0.49 0.65 0.3 NP PP 0.16 0.51 0.35 0.7 Table 2: Probabilities of the two constructions following from the equilibrium equations for the Rescorla-Wagner model. inverse, which will yield a solution that is optimal in the least-squares sense, resulting in the weight matrix W = NP NP NP PP indefinite 0.38 0.12 definite 0.19 0.31 nonpronominal 0.46 0.04 pronominal 0.11 0.39 . (10) The support for the two constructions given a set of input cues is obtained by summation over the association strengths (weights) of the active cues in the input. For instance, for indefinite non-pronominal themes, the summed support for the NP NP construction is 0.38 + 0.46 = 0.84, while the support for the NP PP construction is 0.12 + 0.04 = 0.16. Hence, the probability of the double object construction equals 0.84/(0.84+0.16)= 0.84, and that for the prepositional object construction is 0.16. (In this example, the two measures of support sum up to one, but this is not generally the case for more complex data sets.) One can think of the weights being chosen in such a way that, given the co-occurrences of cues and outcomes, the probability of a construction given the different cues in the input is optimized. We can view this model as providing a re-representation of the data: Eight frequencies (see Table 1) have been replaced by eight weights, representing 50 trials of learning. The model does not work with exemplars, nevertheless, its weights do reflect exemplar frequencies. For instance, the probabilities of the double object construction in Table 2 are correlated with the original frequencies (rs = 0.94, p = 0.051). It is worth noting that the probabilities in Table 2 are obtained with a model that is completely driven by the input, and that is devoid of free parameters — the learning parameters of the Rescorla-Wagner equations (2) drop out of the equilibrium equations. Baayen, Milin, Filipovic Durdjevic, Hendrix, and Marelli (2011) made use of discriminative learning to model visual lexical decision and self-paced reading latencies in Serbian and English. They obtained excellent fits to empirical latencies, both in terms of good correlations at the item level, as well as in terms of the relative importance and effect sizes of a wide range of lexical distributional predictors. Simulated latencies correctly reflected morphological family size effects as well as whole-word CORPUS LINGUISTICS AND NAIVE DISCRIMINATIVE LEARNING 8 frequency effects for complex words, without any complex words being represented in the model as individual units. Their model also predicts word n-gram frequency effects (see also Baayen & Hendrix, 2011). It provides a highly parsimonious account of morphological processing, both in terms of the representations it assumes, and in terms of the extremely limited number of free parameters that it requires to fit the data. For monomorphemic words, the model is essentially parameter free, as in the present example for the dative alternation. Baayen et al. (2011) refer to the present approach as naive discriminative learning, because the probability of a given outcome is estimated independently from all other outcomes. This is a simplification, but thus far it seems that this simplification does not affect performance much, just as often observed for naive Bayes classifiers, while making it possible to obtain model predictions without having to simulate the learning process itself. The question to which we now turn is to what extent naive discriminative learning provides a good fit to corpus data. If the model provides decent fits, then, given that it is grounded in well-established principles of human learning, and given that it performs well in simulations of human processing costs at the lexical level, we can compare discriminative learing with well-established statistical methods in order to answer the question of whether human learning is comparable, superior, or inferior to machine learning. We explore this issue by a more comprehensive analysis of the dative alternation data. Predicting the dative alternation From the dative dataset in the languageR package (Baayen, 2009), the subset of data points extracted from the Switchboard corpus were selected for further analysis. For this subset of the data, information about the speaker is available. In what follows, the probability of the prepositional object construction is taken as the response variable. Software for naive discriminative classification is available in the ndl package for R, available at www.r-project.org. Example code is provided in the appendix. Prediction accuracy A discriminative learning model predicting construction (double object versus prepositional object) was fitted with the predictors Verb, Semantic Class, and the Animacy, Definiteness, Pronominality, and Length of recipient and theme. As the model currently requires discrete cues, as a workaround, the length of recipient and theme were split into three ranges: length 1, lengths 2–4, and lengths exceeding 4. These three length levels were used as cues, instead of the original numerical values. As Bresnan et al. (2007) did not observe significant by-speaker variability, speaker is not included as a predictor in our initial model. (Models including speaker as predictor will be introduced below.) CORPUS LINGUISTICS AND NAIVE DISCRIMINATIVE LEARNING 9 To evaluate goodness of fit, we used two measures, the index of concordance C and the model’s accuracy. The index of concordance C is also known as the receiver operating characteristic curve area ‘C’ (see, e.g. Harrell, 2001). Values of C exceeding 0.8 are generally regarded as indicative of a succesful classifier. Accuracy was defined here as the proportion of correctly predicted constructions, with as cut-off criterion for a correct prediction that the probability for the correct prediction exceed 0.5. According to these measures, the naive discriminative learning model performed well, with C = 0.97 and an accuracy of 0.92. To place the performance of naive discriminative learning (NDL) in perspective, we compared it with memory based learning (MBL), logistic mixed-effects regression (GLMM), and a support vector machine with a linear kernel (SVM). The index of concordance obtained with MBL, using TiMBL version 6.3 (Daelemans, Zavrel, Sloot, & Bosch, 2010), was C =0.89. Its accuracy was 0.92. TiMBL was supplied with speaker information. A logistic mixed-effects regression model, fitted with the lme4 package for R (D. Bates & Maechler, 2009), with both Speaker and Verb as random-effect factors did not converge. As the GLMM did not detect significant speaker-bound variance, we therefore fitted a model with verb as only random-effect factor, including length of theme and recipient as (numerical) covariates. The index of concordance for this model was C = 0.97, accuracy was at 0.93. The regression model required 18 parameters (one random-effect standard deviation, an intercept, and 16 coefficients for slopes and contrasts) to achieve this fit. A support vector machine, provided with access to Speaker information, and fitted with the svm function in the e1017 package for R (Dimitriadou, Hornik, Leisch, Meyer, & Weingessel, 2009), yielded C = 0.97 with accuracy at 0.93, requiring 524 support vectors. From this comparison, naive discriminative learning emerges as more or less comparable in classificatory accuracy to existing state-of-the-art classifiers. It is outperformed in both C and accuracy only by the support vector machine, the currently best-performing classifier available. We note here that the NDL classifier used here is completely parameter-free. The weights are fully determined, and only determined, by the corpus input. There are no choices that the user could make to influence the results. Since speaker information was available to TiMBL and to the SVM, we fitted a second naive discriminative learning model to the data, this time including speaker as a predictor. The index of concordance increased slightly to 0.98, and accuracy to 0.95. Further improvement can be obtained by by allowing pairs of predictor values to function as cues, following the naive discriminative reader model of Baayen et al. (2011). They included both letters and letter bigrams as cues, the former representing static knowledge of which letters are present in the input, the latter representing information about sequences of letters. Analogously, pairs of features, e.g., semantic class p combined with a given theme, can be brought into the learning process. This amounts to considering (when calculating the conditional co-occurrence matrix C CORPUS LINGUISTICS AND NAIVE DISCRIMINATIVE LEARNING 10 all data 10-fold cross-validation C Accuracy C Accuracy SVM 0.98 0.95 0.95 0.91 TiMBL 0.89 0.92 0.89 0.92 GLMM 0.97 0.93 0.96 0.92 NDL (verb) 0.97 0.92 0.89 0.85 NDL (verb+speaker) 0.98 0.95 0.93 0.89 NDL-2 (verb+speaker) 0.99 0.96 0.94 0.91 Table 3: Index of concordance C and accuracy for all data (left) and average across 10-fold cross-validation. not only pairwise co-occurrences of cues, but also the co-occurrences of triplets and quadruplets of cues. Within the framework of naive discriminative learning, this is the functional equivalent of interactions in a regression model. In what follows, NDL-2 refers to a model which includes pairs of features for all predictors, excluding however pairs involving Verb or Speaker. With this richer representation of the input, the index of concordance increased to 0.99 and accuracy to 0.96. However, we now need to assess whether naive discriminative learning achieves this good performance at the cost of overfitting. To assess this possibility, we made use of 10-fold cross-validation, using exactly the same folds for each of the classifiers. The right half of Table 3 summarizes the results. In cross-validation, naive discriminative learning performs less well than the SVM and the GLMM, but similar to TiMBL. Fortunately, concordance and accuracy remain high. We are now in the position to tentatively answer our first question, of whether machine learning outperforms human learning. If naive discriminative learning is indeed a reasonable approximation of human learning, then the answer is that human learning builds a representation of past experience comparable to that of other machine learning techniques. However, for generalization to unseen, new data, human classification seems thus far to be outperformed, albeit only slightly, by some of the best machine classifiers currently available. Effect sizes and variable importance One of the advantages of regression models for linguistic analysis is that the estimated coefficients offer the researcher insight into what forces shape the probabilities of a construction. For instance, a pronominal theme is assigned a β weight of 2.2398 on the log odds scale, indicating that pronominal themes are much more likely to be expressed in a prepositional object construction than in a double object construction. This kind of information is more difficult to extract from a support vector machine or from a memory based model, for which one has to inspect the support vectors or the similarity neighborhoods respectively. Interestingly, the weights of the CORPUS LINGUISTICS AND NAIVE DISCRIMINATIVE LEARNING 11 −0.1 0.0 0.1 0.2 0.3 0.4 − 4 − 3 − 2 − 1 0 1 2 contrasts predicted by naive discriminative learning tr ea tm en t c oe ffi ci en ts G LM M
منابع مشابه
Sidestepping the combinatorial explosion: Towards a processing model based on discriminative learning
Arnon and Snider (2010) documented frequency effects for compositional 4-grams independently of the frequencies of lower-order n-grams. They argue that comprehenders apparently store frequency information about multi-word units. We show that n-gram frequency effects can emerge in a parameter-free computational model driven by naive discriminative learning, trained on a sample of 300,000 4-word ...
متن کاملSidestepping the combinatorial explosion: an explanation of n-gram frequency effects based on naive discriminative learning .
Arnon and Snider ((2010). More than words: Frequency effects for multi-word phrases. Journal of memory and language, 62, 67-82) documented frequency effects for compositional four-grams independently of the frequencies of lower-order n-grams. They argue that comprehenders apparently store frequency information about multi-word units. We show that n-gram frequency effects can emerge in a paramet...
متن کاملConcordance-Based Data-Driven Learning Activities and Learning English Phrasal Verbs in EFL Classrooms
In spite of the highly beneficial applications of corpus linguistics in language pedagogy, it has not found its way into mainstream EFL. The major reasons seem to be the teachers’ lack of training and the unavailability of resources, especially computers in language classes. Phrasal verbs have been shown to be a problematic area of learning English as a foreign language due to their semantic op...
متن کاملStatistical classification and principles of human learning
Statistical classification and principles of human learning In the application of any statistical analysis method to the modeling of linguistic phenomena, a recurring question is how to understand the statistical results from a cognitive perspective. Although quantitative models may provide detailed and useful insights into which factors enhance the probability of particular linguistic phenomen...
متن کاملA Hybrid Generative/Discriminative Approach to Semi-Supervised Classifier Design
Semi-supervised classifier design that simultaneously utilizes both labeled and unlabeled samples is a major research issue in machine learning. Existing semisupervised learning methods belong to either generative or discriminative approaches. This paper focuses on probabilistic semi-supervised classifier design and presents a hybrid approach to take advantage of the generative and discriminati...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2011